Lahmann’s Analysis (Only Going to Concentrate on Certain Time Frame)

I have chosen to do an analysis about the history of baseball because I would want to have a broader understanding of the sport and to have more diverse topics. For this project, we are going to perform analysis in different divisions to determine if there are correlations with attendance, performance, salary, and size.

Loading the Libraries

library(knitr)
library(MASS)
library(tibble)
library(markdown)
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(dplyr)
library(lubridate)
library(broom)
library(dotwhisker)
library(cem)
library(car)
library(plyr)
library(tidyr)
library(scales)
library(plotly)
library(DT)

Reading the Datasets Based on Attendance, Pitching, Player Information, Salary, and Batting

attendance <- read.csv("./teams.csv",na.strings=c("","NA"))
pitching <- read.csv("./pitching.csv")
player_info <- read.csv("./player.csv")
player_salary <- read.csv("./salaries.csv")
batters <- read.csv("./batting.csv", stringsAsFactors = FALSE)
fielding <- read_csv("./fielding.csv",col_types = cols())

Looking through the First Set of Rows on Each Dataset

head(attendance, 5)
##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871 <NA>    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2   1871 <NA>    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3   1871 <NA>    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4   1871 <NA>    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5   1871 <NA>    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 NA  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 NA  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18 NA  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16 NA  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 NA  NA NA 313 121 3.72 32   1  0
##   IPouts  HA HRA BBA SOA   E DP   FP                    name
## 1    828 367   2  42  23 225 NA 0.83    Boston Red Stockings
## 2    753 308   6  28  22 218 NA 0.82 Chicago White Stockings
## 3    762 346  13  53  34 223 NA 0.81  Cleveland Forest Citys
## 4    507 261   5  21  17 163 NA 0.80    Fort Wayne Kekiongas
## 5    879 373   7  42  22 227 NA 0.83        New York Mutuals
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
##   teamIDretro
## 1         BS1
## 2         CH1
## 3         CL1
## 4         FW1
## 5         NY2
head(pitching, 5)
##    playerID yearID stint teamID lgID  W  L  G GS CG SHO SV IPouts   H  ER
## 1 bechtge01   1871     1    PH1 <NA>  1  2  3  3  2   0  0     78  43  23
## 2 brainas01   1871     1    WS3 <NA> 12 15 30 30 30   0  0    792 361 132
## 3 fergubo01   1871     1    NY2 <NA>  0  0  1  0  0   0  0      3   8   3
## 4 fishech01   1871     1    RC1 <NA>  4 16 24 24 22   1  0    639 295 103
## 5 fleetfr01   1871     1    NY2 <NA>  0  1  1  1  1   0  0     27  20  10
##   HR BB SO BAOpp   ERA IBB WP HBP BK BFP GF   R SH SF GIDP
## 1  0 11  1    NA  7.96  NA NA  NA  0  NA NA  42 NA NA   NA
## 2  4 37 13    NA  4.50  NA NA  NA  0  NA NA 292 NA NA   NA
## 3  0  0  0    NA 27.00  NA NA  NA  0  NA NA   9 NA NA   NA
## 4  3 31 15    NA  4.35  NA NA  NA  0  NA NA 257 NA NA   NA
## 5  0  3  0    NA 10.00  NA NA  NA  0  NA NA  21 NA NA   NA
head(player_info, 5)
##    playerID birthYear birthMonth birthDay birthCountry birthState
## 1 aardsda01      1981         12       27          USA         CO
## 2 aaronha01      1934          2        5          USA         AL
## 3 aaronto01      1939          8        5          USA         AL
## 4  aasedo01      1954          9        8          USA         CA
## 5  abadan01      1972          8       25          USA         FL
##    birthCity deathYear deathMonth deathDay deathCountry deathState
## 1     Denver        NA         NA       NA                        
## 2     Mobile        NA         NA       NA                        
## 3     Mobile      1984          8       16          USA         GA
## 4     Orange        NA         NA       NA                        
## 5 Palm Beach        NA         NA       NA                        
##   deathCity nameFirst nameLast      nameGiven weight height bats throws
## 1               David  Aardsma    David Allan    220     75    R      R
## 2                Hank    Aaron    Henry Louis    180     72    R      R
## 3   Atlanta    Tommie    Aaron     Tommie Lee    190     75    R      R
## 4                 Don     Aase Donald William    190     75    R      R
## 5                Andy     Abad  Fausto Andres    184     73    L      L
##        debut  finalGame  retroID   bbrefID
## 1 2004-04-06 2015-08-23 aardd001 aardsda01
## 2 1954-04-13 1976-10-03 aaroh101 aaronha01
## 3 1962-04-10 1971-09-26 aarot101 aaronto01
## 4 1977-07-26 1990-10-03 aased001  aasedo01
## 5 2001-09-10 2006-04-13 abada001  abadan01
head(player_salary, 5)
##   yearID teamID lgID  playerID salary
## 1   1985    ATL   NL barkele01 870000
## 2   1985    ATL   NL bedrost01 550000
## 3   1985    ATL   NL benedbr01 545000
## 4   1985    ATL   NL  campri01 633333
## 5   1985    ATL   NL ceronri01 625000
head(batters, 5)
##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO <NA>  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1 <NA> 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1 <NA> 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3 <NA> 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1 <NA> 25 120 29 39  11   3  0  16  6  2  2
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA
## 4  2  NA  NA NA NA   NA
## 5  1  NA  NA NA NA   NA
head(fielding, 5)
## # A tibble: 5 x 18
##    playerID yearID stint teamID  lgID   POS     G    GS InnOuts    PO
##       <chr>  <int> <int>  <chr> <chr> <chr> <int> <chr>   <chr> <int>
## 1 abercda01   1871     1    TRO  <NA>    SS     1  <NA>    <NA>     1
## 2  addybo01   1871     1    RC1  <NA>    2B    22  <NA>    <NA>    67
## 3  addybo01   1871     1    RC1  <NA>    SS     3  <NA>    <NA>     8
## 4 allisar01   1871     1    CL1  <NA>    2B     2  <NA>    <NA>     1
## 5 allisar01   1871     1    CL1  <NA>    OF    29  <NA>    <NA>    51
## # ... with 8 more variables: A <int>, E <int>, DP <int>, PB <int>,
## #   WP <chr>, SB <chr>, CS <chr>, ZR <chr>

Looking at the Structure of the Dataset

str(attendance)
## 'data.frame':    2805 obs. of  48 variables:
##  $ yearID        : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
##  $ lgID          : Factor w/ 6 levels "AA","AL","FL",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ teamID        : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 90 97 111 136 142 8 ...
##  $ franchID      : Factor w/ 120 levels "ALT","ANA","ARI",..: 13 36 25 56 70 85 91 109 77 9 ...
##  $ divID         : Factor w/ 3 levels "C","E","W": NA NA NA NA NA NA NA NA NA NA ...
##  $ Rank          : int  3 2 8 7 5 1 9 6 4 2 ...
##  $ G             : int  31 28 29 19 33 28 25 29 32 58 ...
##  $ Ghome         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ W             : int  20 19 10 7 16 21 4 13 15 35 ...
##  $ L             : int  10 9 19 12 17 7 21 15 15 19 ...
##  $ DivWin        : Factor w/ 2 levels "N","Y": NA NA NA NA NA NA NA NA NA NA ...
##  $ WCWin         : Factor w/ 2 levels "N","Y": NA NA NA NA NA NA NA NA NA NA ...
##  $ LgWin         : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 1 1 1 ...
##  $ WSWin         : Factor w/ 2 levels "N","Y": NA NA NA NA NA NA NA NA NA NA ...
##  $ R             : int  401 302 249 137 302 376 231 351 310 617 ...
##  $ AB            : int  1372 1196 1186 746 1404 1281 1036 1248 1353 2576 ...
##  $ H             : int  426 323 328 178 403 410 274 384 375 747 ...
##  $ X2B           : int  70 52 35 19 43 66 44 51 54 94 ...
##  $ X3B           : int  37 21 40 8 21 27 25 34 26 35 ...
##  $ HR            : int  3 10 7 2 1 9 3 6 6 14 ...
##  $ BB            : int  60 60 26 33 33 46 38 49 48 27 ...
##  $ SO            : int  19 22 25 9 15 23 30 19 13 28 ...
##  $ SB            : int  73 69 18 16 46 56 53 62 48 35 ...
##  $ CS            : int  NA NA NA NA NA NA NA NA NA 15 ...
##  $ HBP           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SF            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ RA            : int  303 241 341 243 313 266 287 362 303 434 ...
##  $ ER            : int  109 77 116 97 121 137 108 153 137 173 ...
##  $ ERA           : num  3.55 2.76 4.11 5.17 3.72 4.95 4.3 5.51 4.37 3.02 ...
##  $ CG            : int  22 25 23 19 32 27 23 28 32 48 ...
##  $ SHO           : int  1 0 0 1 1 0 1 0 0 1 ...
##  $ SV            : int  3 1 0 0 0 0 0 0 0 1 ...
##  $ IPouts        : int  828 753 762 507 879 747 678 750 846 1545 ...
##  $ HA            : int  367 308 346 261 373 329 315 431 371 566 ...
##  $ HRA           : int  2 6 13 5 7 3 3 4 4 3 ...
##  $ BBA           : int  42 28 53 21 42 53 34 75 45 63 ...
##  $ SOA           : int  23 22 34 17 22 16 16 12 13 0 ...
##  $ E             : int  225 218 223 163 227 194 220 198 217 432 ...
##  $ DP            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ FP            : num  0.83 0.82 0.81 0.8 0.83 0.84 0.82 0.84 0.85 0.82 ...
##  $ name          : Factor w/ 139 levels "Altoona Mountain City",..: 17 42 51 63 93 97 111 131 135 5 ...
##  $ park          : Factor w/ 212 levels "23rd Street Grounds",..: 169 196 115 76 198 86 3 79 129 123 ...
##  $ attendance    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BPF           : int  103 104 96 101 90 102 97 101 94 106 ...
##  $ PPF           : int  98 102 100 107 88 98 99 100 98 102 ...
##  $ teamIDBR      : Factor w/ 101 levels "ALT","ANA","ARI",..: 10 21 25 42 62 4 77 93 65 6 ...
##  $ teamIDlahman45: Factor w/ 148 levels "ALT","ANA","ARI",..: 24 31 39 56 89 96 110 135 140 8 ...
##  $ teamIDretro   : Factor w/ 149 levels "ALT","ANA","ARI",..: 24 31 39 56 89 96 110 135 141 8 ...
str(pitching)
## 'data.frame':    44139 obs. of  30 variables:
##  $ playerID: Factor w/ 9126 levels "aardsda01","aasedo01",..: 465 837 2502 2576 2608 2626 4968 5180 5255 5430 ...
##  $ yearID  : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
##  $ stint   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID  : Factor w/ 149 levels "ALT","ANA","ARI",..: 97 142 90 111 90 136 111 56 97 136 ...
##  $ lgID    : Factor w/ 6 levels "AA","AL","FL",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ W       : int  1 12 0 4 0 0 0 6 18 12 ...
##  $ L       : int  2 15 0 16 1 0 1 11 5 15 ...
##  $ G       : int  3 30 1 24 1 1 3 19 25 29 ...
##  $ GS      : int  3 30 0 24 1 0 1 19 25 29 ...
##  $ CG      : int  2 30 0 22 1 0 1 19 25 28 ...
##  $ SHO     : int  0 0 0 1 0 0 0 1 0 0 ...
##  $ SV      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ IPouts  : int  78 792 3 639 27 3 39 507 666 747 ...
##  $ H       : int  43 361 8 295 20 1 20 261 285 430 ...
##  $ ER      : int  23 132 3 103 10 0 5 97 113 153 ...
##  $ HR      : int  0 4 0 3 0 0 0 5 3 4 ...
##  $ BB      : int  11 37 0 31 3 0 3 21 40 75 ...
##  $ SO      : int  1 13 0 15 0 0 1 17 15 12 ...
##  $ BAOpp   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ ERA     : num  7.96 4.5 27 4.35 10 0 3.46 5.17 4.58 5.53 ...
##  $ IBB     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WP      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HBP     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BK      : int  0 0 0 0 0 0 0 2 0 0 ...
##  $ BFP     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ GF      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ R       : int  42 292 9 257 21 0 30 243 223 362 ...
##  $ SH      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SF      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ GIDP    : int  NA NA NA NA NA NA NA NA NA NA ...
str(player_info)
## 'data.frame':    18846 obs. of  24 variables:
##  $ playerID    : Factor w/ 18846 levels "aardsda01","aaronha01",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ birthYear   : int  1981 1934 1939 1954 1972 1985 1854 1877 1869 1866 ...
##  $ birthMonth  : int  12 2 8 9 8 12 11 4 11 10 ...
##  $ birthDay    : int  27 5 5 8 25 17 4 15 11 14 ...
##  $ birthCountry: Factor w/ 53 levels "","Afghanistan",..: 50 50 50 50 50 18 50 50 50 50 ...
##  $ birthState  : Factor w/ 246 levels "","AB","Aberdeen",..: 44 6 6 30 69 108 173 173 229 148 ...
##  $ birthCity   : Factor w/ 4714 levels "","Aberdeen",..: 1093 2718 2718 3092 3159 2212 3279 2291 1337 1382 ...
##  $ deathYear   : int  NA NA 1984 NA NA NA 1905 1957 1962 1926 ...
##  $ deathMonth  : int  NA NA 8 NA NA NA 5 1 6 4 ...
##  $ deathDay    : int  NA NA 16 NA NA NA 17 6 11 27 ...
##  $ deathCountry: Factor w/ 24 levels "","American Samoa",..: 1 1 22 1 1 1 22 22 22 22 ...
##  $ deathState  : Factor w/ 93 levels "","AB","AK","AL",..: 1 1 26 1 1 1 57 25 88 12 ...
##  $ deathCity   : Factor w/ 2554 levels "","Aberdeen",..: 1 1 91 1 1 1 1735 758 462 1984 ...
##  $ nameFirst   : Factor w/ 2313 levels "","A. J.","Aaron",..: 513 918 2087 599 73 775 1148 649 158 335 ...
##  $ nameLast    : Factor w/ 9713 levels "Aardsma","Aaron",..: 1 2 2 3 4 4 5 6 7 7 ...
##  $ nameGiven   : Factor w/ 12437 levels "","A. Harry",..: 2491 5099 11411 2886 3732 3766 6604 3201 954 1714 ...
##  $ weight      : int  220 180 190 190 184 220 192 170 175 169 ...
##  $ height      : int  75 72 75 75 73 73 72 71 71 68 ...
##  $ bats        : Factor w/ 4 levels "","B","L","R": 4 4 4 4 3 3 4 4 4 3 ...
##  $ throws      : Factor w/ 3 levels "","L","R": 3 3 3 3 2 2 3 3 3 2 ...
##  $ debut       : Factor w/ 10037 levels "","1871-05-04",..: 8636 4698 5145 6222 8419 9383 106 1132 875 934 ...
##  $ finalGame   : Factor w/ 9029 levels "","1871-05-05",..: 8991 5859 5560 6745 7984 9028 103 1851 1078 1110 ...
##  $ retroID     : Factor w/ 18793 levels "","aardd001",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ bbrefID     : Factor w/ 18846 levels "","aardsda01",..: 2 3 4 5 6 7 8 9 10 11 ...
str(player_salary)
## 'data.frame':    25575 obs. of  5 variables:
##  $ yearID  : int  1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
##  $ teamID  : Factor w/ 35 levels "ANA","ARI","ATL",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lgID    : Factor w/ 2 levels "AL","NL": 2 2 2 2 2 2 2 2 2 2 ...
##  $ playerID: Factor w/ 4963 levels "aardsda01","aasedo01",..: 229 288 324 689 795 805 1090 1447 1529 1854 ...
##  $ salary  : int  870000 550000 545000 633333 625000 800000 150000 483333 772000 250000 ...
str(batters)
## 'data.frame':    101332 obs. of  22 variables:
##  $ playerID: chr  "abercda01" "addybo01" "allisar01" "allisdo01" ...
##  $ yearID  : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
##  $ stint   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID  : chr  "TRO" "RC1" "CL1" "WS3" ...
##  $ lgID    : chr  NA NA NA NA ...
##  $ G       : int  1 25 29 27 25 12 1 31 1 18 ...
##  $ AB      : int  4 118 137 133 120 49 4 157 5 86 ...
##  $ R       : int  0 30 28 28 29 9 0 66 1 13 ...
##  $ H       : int  0 32 40 44 39 11 1 63 1 13 ...
##  $ X2B     : int  0 6 4 10 11 2 0 10 1 2 ...
##  $ X3B     : int  0 0 5 2 3 1 0 9 0 1 ...
##  $ HR      : int  0 0 0 2 0 0 0 0 0 0 ...
##  $ RBI     : int  0 13 19 27 16 5 2 34 1 11 ...
##  $ SB      : int  0 8 3 1 6 0 0 11 0 1 ...
##  $ CS      : int  0 1 1 1 2 1 0 6 0 0 ...
##  $ BB      : int  0 4 2 0 2 0 1 13 0 0 ...
##  $ SO      : int  0 0 5 2 1 1 0 1 0 0 ...
##  $ IBB     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HBP     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SH      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SF      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ GIDP    : int  NA NA NA NA NA NA NA NA NA NA ...
str(fielding)
## Classes 'tbl_df', 'tbl' and 'data.frame':    170526 obs. of  18 variables:
##  $ playerID: chr  "abercda01" "addybo01" "addybo01" "allisar01" ...
##  $ yearID  : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
##  $ stint   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID  : chr  "TRO" "RC1" "RC1" "CL1" ...
##  $ lgID    : chr  NA NA NA NA ...
##  $ POS     : chr  "SS" "2B" "SS" "2B" ...
##  $ G       : int  1 22 3 2 29 27 1 2 20 5 ...
##  $ GS      : chr  NA NA NA NA ...
##  $ InnOuts : chr  NA NA NA NA ...
##  $ PO      : int  1 67 8 1 51 68 7 3 38 10 ...
##  $ A       : int  3 72 14 4 3 15 0 4 52 0 ...
##  $ E       : int  2 42 7 0 7 20 0 1 28 8 ...
##  $ DP      : int  0 5 0 0 1 4 0 0 2 0 ...
##  $ PB      : int  NA NA NA NA NA 0 NA NA NA 0 ...
##  $ WP      : chr  NA NA NA NA ...
##  $ SB      : chr  NA NA NA NA ...
##  $ CS      : chr  NA NA NA NA ...
##  $ ZR      : chr  NA NA NA NA ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 18
##   .. ..$ playerID: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ yearID  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ stint   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ teamID  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ lgID    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ POS     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ G       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ GS      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ InnOuts : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ PO      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ A       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ E       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ DP      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ PB      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ WP      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ SB      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ CS      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ ZR      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Cleaning the Data, More Cleaning Along the Way

Fixing the Attendance Data

attendance <- subset(attendance, attendance != 'NA')
attendance <- subset(attendance, yearID > 1969)

Fixing the Pitching Data

pitching <- subset(pitching, pitching  !='NA')
pitching[ , colSums(is.na(pitching)) == 0]
## data frame with 0 columns and 1197590 rows

Fixing the Batters Data

batters <- subset(batters, batters !='NA')
batters$SF <- as.integer(batters$SF)

Fixing the Player Info Data

player_info <- subset(player_info, player_info != 'NA')

Fixing the Fielding Data

fieldING <- subset(fielding, fielding != 'NA')
fielding = subset(fielding, yearID > 1920)
fieldING <- subset(fielding, E != 'NA')
names(player_salary) <- c("yearID","teamID","lgID","playerID","salary")

Add Salary to Batting and Pitching Information

batters <- dplyr::inner_join(batters,player_salary,by=c("playerID","teamID","lgID","yearID"))
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
## Warning: Column `teamID` joining character vector and factor, coercing into
## character vector
## Warning: Column `lgID` joining character vector and factor, coercing into
## character vector
pitching <- dplyr::inner_join(pitching,player_salary,by=c("playerID","teamID","lgID","yearID"))
## Warning: Column `playerID` joining factors with different levels, coercing
## to character vector
## Warning: Column `teamID` joining factors with different levels, coercing to
## character vector
## Warning: Column `lgID` joining factors with different levels, coercing to
## character vector

Get Rid of the Batters in the Pitching Data

batters <- dplyr::anti_join(batters,pitching,by="playerID")

Visualizing the Attendance for Baseball

ggplot(attendance, aes(yearID, attendance)) + geom_bar(stat="identity", aes(fill=G>157))+
  guides(fill=FALSE)+
  ggtitle("Total Home Attendance on an Annual Basis (Lockouts in 72, 81, 94, and 95)")+
  scale_y_continuous(labels = comma) +
  theme_minimal()

Product on the Field

product <- ggplot(attendance, aes(W, log10(attendance))) + geom_point(color="darkblue") + 
  labs(x="Wins", y="Attendance") +
  scale_y_continuous(labels = comma) +
  ggtitle("Wins Based on Attendance") +
  geom_smooth(method = "lm", col = "brown")

ggplotly(product)
hrattn <- ggplot(attendance, aes(HR, attendance)) + geom_point(color="brown") + 
  labs(x="Home Runs", y="Attendance") + 
  scale_y_continuous(labels = comma) +
  ggtitle("Home Runs Based on Attendance")+
  geom_smooth(method = "lm", col = "orange")

ggplotly(hrattn)

Despite winning 67 games in the 1993 season for Colorado Rockies, it has attracted about 4.5 million attendance. However, it was the first season into the franchise, so there has to be a fanbase for the team in order to attract many people to come. In comparison to the other end, there are some bad teams who’s seasons and attendance were low during the lockout seasons. The 1981 Chicago Cubs were stopped at 38-65 record with an attendance of 565,637 before the season was cut short. Many outliers on this graph. With the regression line, it has been determined that there are more data points being above the threshold.

Comparing them to the Losses on the Field

lossatn <- ggplot(attendance, aes(L, attendance)) + geom_point(color="magenta") + labs(x="Losses", y="Attendance") +
  scale_y_continuous(labels = comma) +
  ggtitle("Losses Based on Attendance")+
  geom_smooth(method = "lm", col = "green") +
  theme_minimal()
ggplotly(lossatn)

It seems that around 95 losses, there must be a lot of attendance, which is interesting. It does trace back to the 1993 season with Colorado Rockies. It is possible that it could be one of the well-known franchise teams that must have a bad season that is marred by many factors such as injuries, personal, or team conflict. Although, it is hard to predict. Lot more points above the threshold with the errors to attendance ratio. It has been determined that it is not a good model since r-squared is heading to the negative distribution.

Comparing them to the Errors

errors <- ggplot(attendance, aes(E, log10(attendance))) + geom_point(color="darkred") +   labs(x="Errors", y="Attendance") +
  scale_y_continuous(labels = comma) +
  ggtitle("Attendance Based on Errors")+
  geom_smooth(method = "lm", col = "mistyrose")

ggplotly(errors)

Joining the Moneyball Data

salary <- select(player_salary, yearID, teamID, salary)
salary <- group_by(salary, yearID, teamID) %>% summarise(dollars = sum(salary))
team_salary <- left_join(attendance, salary)
## Joining, by = c("yearID", "teamID")
## Warning: Column `teamID` joining factors with different levels, coercing to
## character vector
team_salary <- subset(team_salary, yearID > 1984) #salary data starts at 1984

If any of us are believing that the money is paid for all the high-performing athletes, it isn’t necessarily the case. It is not drawn based on attendance either. There are many factors that determine the high pay for attracting the best talent. It could be based on their performance on defense as well. There are many examples to determine it.

Team Salary vs. Attendance

ggplot(team_salary, aes(dollars, attendance)) + geom_point(color="darkgreen",size=4, shape=36) + 
  labs(x="Team Salary", y="Attendance") +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(labels = comma) +
  ggtitle("Team Salary & Attendance")+
  geom_smooth(method = "lm", col = "blue")

In this visual, it has been seen that for the maximum attendance that is around 4.5 million people, there has been a lower team salary, probably because the franchise team had a bad season occurring in factors such as the team management or injuries. For the maximum team salary, it turns out that there is at least an attendance of over 3 million people.

Creating the Moneyball Information (From 1969)

creating a New Variable for Run Difference

attendance$RD <- attendance$R - attendance$RA

Relationship between wins and run difference for Oakland A’s

ggplot(data = attendance) + geom_smooth(mapping = aes(x = W, y = RD), color="darkblue") + 
  ggtitle('Relationship Between Wins and Run Difference') + labs(x="Wins", y="Runs Difference")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Creating a Linear Regression Model for Run Difference Oakland A’s

wins_reg <- lm(W ~ RD, data = attendance)
summary(wins_reg)
## 
## Call:
## lm(formula = W ~ RD, data = attendance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.709  -2.045   0.919   3.896  13.882 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 79.558544   0.187490  424.33   <2e-16 ***
## RD           0.102752   0.001871   54.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.666 on 1262 degrees of freedom
## Multiple R-squared:  0.705,  Adjusted R-squared:  0.7048 
## F-statistic:  3017 on 1 and 1262 DF,  p-value: < 2.2e-16

Histogram and qqplots for wins_reg

hist(rnorm(wins_reg$residuals), col = "red", main = "Wins for Run Difference", xlab = "Residuals")

QQ Plots for Wins_Reg

qqnorm(wins_reg$residuals)
qqline(wins_reg$residuals, col = "green")

Sample Mean Calculations for Wins Reg with Rnorm

winsregrx <- rnorm(wins_reg$residuals)
fitdistr(winsregrx, "normal")
##       mean           sd     
##   -0.01071004    1.02530327 
##  ( 0.02883891) ( 0.02039219)
wrfit <- fitdistr(winsregrx, "normal")
wrfit$estimate
##        mean          sd 
## -0.01071004  1.02530327

Sample Mean for Wins Reg

sum(wins_reg$residuals)/length(wins_reg$residuals)
## [1] -1.695198e-17

Shapiro-Wilk Test for Wins Reg

wrnd <- rnorm(wins_reg$residuals)
shapiro.test(wrnd)
## 
##  Shapiro-Wilk normality test
## 
## data:  wrnd
## W = 0.99819, p-value = 0.1919

Based on the information between the Wins and Runs Difference, this is the equation that is came up for the linear regression mode:

“w = 78.957265 + 0.102115(RD) = 78.957265 + 0.102115(135) ~ 93

Using 135 as a Predictor for Oakland A’s

It was predicted at 93 runs given in 2002 for the team concentrated on the Oakland A’s performance. It is still one of the predictors in the baseball statistics. In addition with the qqplots and histograms, it turns out that there is a strong correlation to its sample as well on top of predictions. With the p-value predicted that is > 0.05, we can fail to reject the null hypothesis. Sample mean is not equal to zero. The Shapiro-Wilk test shows that we could reject the null hypothesis as to the t-test.

Linear Regression Model for Scored Runs for Oakland A’s

runs_scored <- lm(R ~ AB + H + X2B + X3B + HR + BB + SB + CS, data = attendance)
summary(runs_scored)
## 
## Call:
## lm(formula = R ~ AB + H + X2B + X3B + HR + BB + SB + CS, data = attendance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.974 -14.334  -0.169  14.181  75.824 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.418867   9.469745   0.361   0.7181    
## AB          -0.108004   0.003312 -32.615  < 2e-16 ***
## H            0.627810   0.012241  51.287  < 2e-16 ***
## X2B          0.199960   0.024100   8.297 2.72e-16 ***
## X3B          0.592950   0.072481   8.181 6.85e-16 ***
## HR           0.956610   0.022378  42.749  < 2e-16 ***
## BB           0.340556   0.009755  34.911  < 2e-16 ***
## SB           0.188601   0.021662   8.707  < 2e-16 ***
## CS          -0.096706   0.055403  -1.745   0.0811 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.45 on 1255 degrees of freedom
## Multiple R-squared:  0.9491, Adjusted R-squared:  0.9488 
## F-statistic:  2926 on 8 and 1255 DF,  p-value: < 2.2e-16

Creating a Logistic Regression Model for Scored Runs for Comparisions

runs_scored <- lm(R ~ AB + H + X2B + X3B + HR + BB + SB + CS, data = attendance, family = "binomial")
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'family' will be disregarded
summary(runs_scored)
## 
## Call:
## lm(formula = R ~ AB + H + X2B + X3B + HR + BB + SB + CS, data = attendance, 
##     family = "binomial")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.974 -14.334  -0.169  14.181  75.824 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.418867   9.469745   0.361   0.7181    
## AB          -0.108004   0.003312 -32.615  < 2e-16 ***
## H            0.627810   0.012241  51.287  < 2e-16 ***
## X2B          0.199960   0.024100   8.297 2.72e-16 ***
## X3B          0.592950   0.072481   8.181 6.85e-16 ***
## HR           0.956610   0.022378  42.749  < 2e-16 ***
## BB           0.340556   0.009755  34.911  < 2e-16 ***
## SB           0.188601   0.021662   8.707  < 2e-16 ***
## CS          -0.096706   0.055403  -1.745   0.0811 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.45 on 1255 degrees of freedom
## Multiple R-squared:  0.9491, Adjusted R-squared:  0.9488 
## F-statistic:  2926 on 8 and 1255 DF,  p-value: < 2.2e-16

Histogram for Runs Scored Residuals

hist(rnorm(runs_scored$residuals), col = "blue", main = "Runs Scored Residuals", xlab = "Residuals")

QQ Plots for Runs Scored Residuals

qqnorm(runs_scored$residuals)
qqline(runs_scored$residuals, col = "salmon")

Sample Mean Calculations for Scored Residuals for Rnorm

runsscoredrx <- rnorm(runs_scored$residuals)
fitdistr(runsscoredrx, "normal")
##        mean             sd      
##   -0.0001699746    0.9689010446 
##  ( 0.0272524711) ( 0.0192704071)
rsfit <- fitdistr(runsscoredrx, "normal")
rsfit$estimate
##          mean            sd 
## -0.0001699746  0.9689010446

Sample Mean for Runs Scored

sum(runs_scored$residuals)/length(runs_scored$residuals)
## [1] -7.358302e-17

Shapiro-Wilk Test for Scored Residuals

srnd <- rnorm(runs_scored$residuals)
shapiro.test(srnd)
## 
##  Shapiro-Wilk normality test
## 
## data:  srnd
## W = 0.99918, p-value = 0.8667

T-Test for Scored Residuals

srtest <- t.test(srnd)
srtest$p.value
## [1] 0.2871109

By comparing the differences utilizing both the linear and logistic regression models, the results turn out to be the same. R-squared equated to 0.9483, which is a really good model. It turns out that the equation from the model equates to:

= 5.177834 + (-0.105735 * AB) + (0.621239 * H) + (0.207906 * X2B) + (0.544204 * X3B) + (0.944925* HR) + (0.341597 * BB) + (0.262176 * SB) + (-0.274041 * CS)

Prediciton Example for the Equation for Oakland A’s

= 5.177834 + (-0.105735 * 5558) + (0.621239 * 1450) + (0.207906 * 279) + (0.544204 * 28) + (0.944925* 205) + (0.341597 * 609) + (0.262176 * 46) + (-0.274041 * 20) = 799.8642

r ~ 800

It turns out that it is coming close to 800 for Oakland A’s in 2002.

Also, by running the statistical tests, it has been proven that we can fail to reject the null hypothesis again as p is > 0.05. Sample mean is not equated to zero for the scored residuals. The Shapiro-Wilk test shows that we could also fail the reject the null hypothesis again as to the t-test.

Runs Allowed Regression

runs_allowed <- lm(R ~ SV + IPouts + H + ER + HR + BB + SO + WP + HBP + BK, data = pitching)
summary(runs_allowed)
## 
## Call:
## lm(formula = R ~ SV + IPouts + H + ER + HR + BB + SO + WP + HBP + 
##     BK, data = pitching)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9190  -1.4762  -0.2931   1.1428  19.6198 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0257224  0.0457194  -0.563   0.5737    
## SV          -0.0057003  0.0031700  -1.798   0.0722 .  
## IPouts      -0.0089103  0.0009841  -9.054  < 2e-16 ***
## H            0.0911030  0.0032199  28.293  < 2e-16 ***
## ER           0.9116464  0.0048153 189.324  < 2e-16 ***
## HR           0.0186677  0.0078987   2.363   0.0181 *  
## BB           0.0576205  0.0027694  20.806  < 2e-16 ***
## SO          -0.0103622  0.0014128  -7.335 2.37e-13 ***
## WP           0.0720413  0.0107940   6.674 2.60e-11 ***
## HBP          0.0489909  0.0109970   4.455 8.47e-06 ***
## BK           0.1631391  0.0263168   6.199 5.87e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.66 on 11520 degrees of freedom
## Multiple R-squared:  0.9937, Adjusted R-squared:  0.9937 
## F-statistic: 1.814e+05 on 10 and 11520 DF,  p-value: < 2.2e-16

Histogram for Runs Allowed

hist(rnorm(runs_allowed$residuals), col = "yellow", main = "Runs Allowed Residuals", xlab = "Residuals")

QQ Plots for Runs Allowed

qqnorm(runs_allowed$residuals, col = "green")
qqline(runs_allowed$residuals, col = "darkred")

Sample Mean Calculations for Wins Reg

raregrx <- rnorm(runs_allowed$residuals)
fitdistr(raregrx, "normal")
##        mean            sd     
##   -0.021895438    0.998427747 
##  ( 0.009297863) ( 0.006574582)
rafit <- fitdistr(raregrx, "normal")
rafit$estimate
##        mean          sd 
## -0.02189544  0.99842775

Sample Mean for Runs Allowed

sum(runs_allowed$residuals)/length(runs_allowed$residuals)
## [1] -2.570166e-16

T-Test for Wins Reg

testwr <- t.test(runs_allowed$residuals)
testwr$p.value
## [1] 1

By having this equation:

= 0.036056 + (-0.014268 * SV) + (-0.011360 * IPouts) + (0.102181 * H) + (0.896016 * ER) + (0.026607 * HR) + (0.062142 * BB) + (-0.007035 * SO) + (0.075391 * WP) + (0.078287 * HBP) + (0.149150 * BK)

= 0.036056 + (-0.014268 * 48) + (-0.011360 * 4356) + (0.102181 * 1450) + (0.896016 * 593) + (0.026607 * 205) + (0.062142 * 609) + (-0.007035 * 1008) + (0.075391 * 2.353) + (0.078287 * 68) + (0.149150 * 0.5294)

~ 671

By projecting that Oakland A’s would allow 671 runs, however, it has totaled out to 654, which is still accurate for the 2002 season.

Many teams in the Major League baseball concentrated on Batting Average (BA) too much.Oakland A’s discovered that On-Base Percentage (OBP) and Slugging Percentage are the top category for the most important statistical measure. However, we are going to see if that holds true.

Based on the analysis for the p-value and using the t-test, it has proven that we could fail to reject the null hypothesis for the runs allowed variable. Sample mean is not zero at all.

Performance During the Seasons Between 2002-2016

get_wins <- attendance
get_wins$win_ratio <- get_wins$W/get_wins$G
oakland <- subset(get_wins, teamID == 'OAK' & yearID >= 2002)
athletics <- ggplot(data = oakland) + 
  geom_line(mapping = aes(x = yearID, y = win_ratio), color="darkseagreen") + 
  ggtitle("Oakland A's Performance During 2002-2016") + 
  labs(x="year", y="win-ratio") + 
  scale_x_continuous(breaks = round(seq(min(oakland$year), max(oakland$year), by = 1),1))
ggplotly(athletics)

In this part, the model runs at r-squared that is 0.99, which it is almost perfect when there are more than two variables that are described in analyzing the wins, runs, and scores. By comparison with the first two regression models, the runs scored is also works and will be used for analysis, which is at 0.94. However, looking at the first model, the fit is only OK given that r-squared is at 0.71, and it does not always have valid information to make decisions by looking at two variables focusing only on wins attribute. Proven that focusing at least on 3 or more variables would give a better description on making the analyses on the model.

Batting Averages

test<- subset(attendance, yearID >= 1985 & yearID < 2002)
test$obp <- (test$H + test$HBP + test$BB)/(test$AB + test$BB + test$HBP + test$SF)
test$slg <- ((test$BB) + (2 * test$X2B) + (3 * test$X3B) + (4 * test$HR))/(test$AB)
test$ba <- (test$H)/(test$AB)

Comparision of OBP, SLG and BA

ggplot(test, aes(W)) + 
  geom_smooth(aes(y = obp,  color="obp")) + 
  geom_smooth(aes(y = slg,  color="slg")) + 
  geom_smooth(aes(y = ba,  color="ba")) + labs(x="Wins", y="%") + 
  ggtitle("Comparison of OBP, SLG and BA")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 408 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

From the information that is gathered in the above plot, it is deemed that the On-Base Percentage (OBP) and Slugging Percentage (SLG) statistics are more important based on the earlier prediction on the analysis.

Salaries Overtime from 1985-2016

p <- ggplot(player_salary, aes(x=factor(yearID), y=log(salary), fill = lgID)) +
  geom_boxplot() +
  xlab("Year") +
  ylab("Annual Salary in Millions") +
  ggtitle("Breakdown with the League Differences in Salaries from 1985") +
  scale_x_discrete(breaks=unique(player_salary$yearID)[c(TRUE, FALSE)]) +
  scale_y_continuous(label=dollar) +
  theme_light(base_size=10)
print(p)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Salaries Overtime from 1985-2016 Without the League

For the salaries comparison between the leagues, it has turned out that there has been equal distributions overtime adjusted for inflation. However, it turns out that American League has been getting the higher pay for the last ten years except in 2014.

Batters Overtime from 1985-2016

homeruns_by_year <- batters %>% 
  group_by(yearID) %>% 
  summarise(HR = sum(HR))
p1 <- ggplot(homeruns_by_year, aes(x=yearID, y=HR)) +
  geom_point() +
  xlab("Year") +
  ylab("Home Runs") +
  scale_x_discrete(breaks=unique(homeruns_by_year$yearID)[c(TRUE, FALSE)]) +
  scale_y_continuous() +
  theme_light(base_size=10) + 
  ggtitle("Batters Overtime")
ggplotly(p1)

Looking Up the One Hit Wonders

Consider Only Stats After 1950

hr_since50 <- batters %>% filter(yearID>=1950) %>% select(playerID,yearID,stint,HR);

If a player has multiple “stints” in one season, calculate the total HR for the season.

hr_season_since50 <- hr_since50 %>% group_by(playerID, yearID) %>% summarise(seasonHR=sum(HR));

Calculate career total HR, career season count for players. Remove players who appeared for lesser than 5 seasons.

hr_season_since50 <- hr_season_since50 %>% group_by(playerID) %>% mutate(careerHR=sum(seasonHR),career_year=n()) %>% filter(career_year>=5);

Calculate career average HR per season for each player. Calculate the ratio of single year HR to career average.

yearly_hr_vs_career_avg <- hr_season_since50 %>% mutate(career_per_season=careerHR/career_year, year_vs_avg=seasonHR/career_per_season);

Joining Player Data for Birth and Year

hr_fluke_ratio_and_age <- inner_join(yearly_hr_vs_career_avg, player_info, by="playerID") %>% 
  mutate(age=yearID-birthYear)%>% 
  select(playerID, nameFirst, nameLast, yearID, age, seasonHR, careerHR, career_year, career_per_season, year_vs_avg) %>% 
  ungroup();
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector

Keep the rows for which the player hit more than 2x HRs than career average, and HR>=10 on that year.

out<-filter(hr_fluke_ratio_and_age, seasonHR>=10, year_vs_avg>=2) %>% arrange(desc(year_vs_avg));

Most Deviation Form on a Data Table

datatable(out, class='compact');

Most Home Runs in One Season for a Data Table

out2 <- out %>% arrange(desc(seasonHR));
datatable(out2, class='compact');

Visualization of the One-Hit Wonders

onehit <- ggplot(out, aes(age,year_vs_avg)) + 
  geom_point(color = "blue", alpha = 0.4) + 
  labs(x="Player Age", y="Season HR per Career Average") + 
  ggtitle("Players With One Season HR More Than 2x Career Average") + 
  theme_minimal();
ggplotly(onehit)

Based on the visualization, it turns out that the late 20s to the mid 30s are the players that have had a great season for the one-hit wonders. The data is compared from the Home Run per season vs. the career. Outliers happen more after the late to mid 30s.

Taking a Look at the Closers

options(dplyr.width = Inf)

Looking at Data After 1985

pitching <- pitching  %>% filter(yearID>=1985) %>% 
  select(playerID,yearID,stint,W,L,G,GS,GF,SV,IPouts,H,BB,ER,SO,R)

Calculating the Totals For the Season Based On Stints

pitching <- pitching  %>% 
  group_by(playerID, yearID) %>% 
  summarise(W=sum(W), L=sum(L), G=sum(G), GS=sum(GS),GF=sum(GF),SV=sum(SV),IPouts=sum(IPouts),H=sum(H),BB=sum(BB),ER =sum(ER),SO=sum(SO),R=sum(R)) %>% 
  ungroup();

Players Not Started this Year

rp_pitching <- pitching %>% filter(GS==0);

Getting player Names and Information

player_info <- player_info %>% 
  separate(debut, c("debut"), sep="-") %>% 
  select(playerID, birthYear, nameFirst, nameLast, debut)
## Warning: Too many values at 18653 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
## 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
rp_pitching <- inner_join(rp_pitching, player_info, by="playerID")%>% 
  mutate(id_year=paste(playerID, yearID, sep="_"))
## Warning: Column `playerID` joining character vector and factor, coercing
## into character vector
player_salary <- player_salary %>% 
  mutate(id_year = paste(playerID, yearID, sep="_")) %>% 
  select(id_year,salary)
rp_pitching <- inner_join(rp_pitching, player_salary, by="id_year");

Percentage of Finishing Off

rp_pitching<-rp_pitching %>% 
  mutate(gf_ratio=GF/162, gf_percent=GF/G)

Calculating the Mean Salary Annually and Analyzing the WHIP (Walk + Hit Per Inning Pitched)

rp_pitching<-rp_pitching %>% 
  group_by(yearID) %>% 
  mutate(season_avg_salary=mean(salary)) %>% 
  ungroup() %>% 
  mutate(salary_ratio=salary/season_avg_salary, season=yearID-as.numeric(debut), whip=(H+BB)/IPouts*3, era=ER/IPouts*27);

Replacing the Datapoints with NA

rp_pitching<-do.call(data.frame,lapply(rp_pitching, function(x) replace(x, is.infinite(x),NA)))

Looking at Salaries Per Peformance

cor(rp_pitching$salary_ratio, rp_pitching$whip,use="pairwise.complete")
## [1] -0.1012409
cor(rp_pitching$salary_ratio, rp_pitching$era,use="pairwise.complete")
## [1] -0.09562511

Who are the Closers?

closers<- rp_pitching %>% filter(SV > 15);
nonclosers<- rp_pitching %>% filter(SV <= 15);
fa_closers <- closers %>% filter(season>=7&IPouts>=150);
fa_nonclosers <- nonclosers %>% filter(season>=7&IPouts>=150);
nrow(fa_closers)
## [1] 286

Non-Closers

nrow(fa_nonclosers)
## [1] 633

Based on this threshold, we have 315 FA closer seasons vs. 715 FA Non-Closer seasons. It has been determined that there needs to be more analysis by changing the limits on more than one save per season.

Boxplots of Closers vs. Non-Closers

boxplot(fa_closers$gf_percent*100, fa_nonclosers$gf_percent*100,  
        names=c("Closers", "Non-Closers"), 
        ylab="% of Finishing a Game (Being Last Pitcher In Game)",
        main = "Closers vs. Non-Closers",
        col = "lightcoral")

summary(fa_closers$gf_percent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3544  0.8143  0.8646  0.8432  0.9000  1.0000
summary(fa_nonclosers$gf_percent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05263 0.20270 0.28750 0.30963 0.39216 0.88095

By looking at the boxplots, closers finished the game 83% of the time, while the non-closers only finish in the 31% of the time. The information is based on the difference with 15 or more saves per season.

Last Pitcher in Game to Finish Up

plot(density(fa_closers$gf_percent*100), xlim=c(0,100), col="blue", xlab="% of Finishing a Game (Being Last Pitcher in Game)", main="")
lines(density(fa_nonclosers$gf_percent*100), col="darkred")
legend(0,0.06,c("Non-Closers", "Closers"),c("darkred", "blue"))

Closers vs. Non-Closers Performance

summary(fa_closers$whip)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6069  1.0307  1.1646  1.1791  1.3306  1.8068
summary(fa_nonclosers$whip)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.773   1.189   1.305   1.315   1.432   1.962
summary(fa_closers$salary_ratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2226  2.0451  3.1861  3.4617  4.5667  9.0269
summary(fa_nonclosers$salary_ratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1409  0.6725  1.2106  1.3963  1.9288  8.3391

WHIP Difference

boxplot(fa_closers$whip, fa_nonclosers$whip, 
        names=c("Closers", "Non-Closers"), ylab="WHIP (Hit+Walk per Inning)",
        main = "Difference of WHIP",
        col = "darkolivegreen")

In this analysis on the WHIP difference, it turns out that the non-closers have made more of an impact than the closers. Although, when it comes to the salary ratio, the closers get paid more, particularly the relief pitchers by difference of per inning.

boxplot(fa_closers$salary_ratio, fa_nonclosers$salary_ratio, 
        names=c("Closers", "Non-Closers"), ylab="Salary / Avg. Relief Pitcher Salary",
        main = "Salary Ratio on WHIP Differences",
        col = "goldenrod")

Scatterplot Differences

plot(fa_closers$salary_ratio,fa_closers$whip, 
     xlab="Salary / Avg. Relief Pitcher Salary", ylab="WHIP (Hit+Walk per Inning)", 
     main = "Difference Between the Salary Ratio on Relief Pitchers",
     xlim=c(0,10), ylim=c(0.6,2.0), col = "green", pch = 19)
points(fa_nonclosers$salary_ratio, fa_nonclosers$whip, col=2, pch = 19)
legend(7,2.0,c("Closers", "Non-Losers"),c("green", "red"))

low_whip_closers<-fa_closers %>% filter(whip<=1.20)
low_whip_nonclosers<-fa_nonclosers %>% filter(whip<=1.20)
nrow(low_whip_closers)
## [1] 164
nrow(low_whip_nonclosers)
## [1] 166

By looking at the scatterplot of the relief pitcher salary, it has been determined that the closers have a migh higher pay based on the green dots that are in the right side of the distribution of the analysis. There are more red dots on the left, and at most get up to the middle of the distribution. The one outlier for the non-closer must have to be the best player in the league in order to achieve that status to receive the highest pay in the league.

Low- WHIP Nonclosers

boxplot(low_whip_closers$whip, low_whip_nonclosers$whip, 
        names=c("Closers", "Non-Closers"), ylab="WHIP (Hit+Walk per Inning)",
        main = "Low WHIP Non-Closers",
        col = "skyblue")

Difference with Relief-Pitchers

boxplot(low_whip_closers$salary_ratio, low_whip_nonclosers$salary_ratio, 
        names=c("Closers", "Non-Closers"), ylab="Salary / Avg. Relief Pitcher Salary",
        main = "Difference with Relief Pitchers",
        col = "maroon")

plot(low_whip_closers$salary_ratio,low_whip_closers$whip, 
     xlab= "Salary / Avg. Relief Pitcher Salary", ylab="WHIP (Hit+Walk per Inning)",
     main = "Difference of Salaries",
     xlim=c(0,10), ylim=c(0.6,1.3), col = "blue", pch = 16)
points(low_whip_nonclosers$salary_ratio, low_whip_nonclosers$whip, col=2, pch = 16)
legend(7,1.3,c("Closers", "Non-Losers"),c("blue", "red"))

Again, by looking at the plots again, it has been determined that the closing relief pitchers have been making more closing moments in the game headed towards the right in the distribution points that are shaded in blue. Same comparison as with the previous visuals with the walks per hit and inning ratio.

Low-ERA Closers

low_era_closers<-fa_closers %>% filter(era<=3.0)
low_era_nonclosers<-fa_nonclosers %>% filter(era<=3.0)
nrow(low_era_closers)
## [1] 161
nrow(low_era_nonclosers)
## [1] 172

Comparisons with ERA on Closers vs. Non-Closers

boxplot(low_era_closers$era, low_era_nonclosers$era, 
        names=c("Closers", "Non-Closers"), ylab="ERA (ER per 9 Innings)",
        main = "Comparisons with ERA on Closers vs. Non-Closers",
        col = "red")

By looking at the boxplots again for closers and non-closers, it has seemed that the closers have made more salary when the game goes down the wire by playing all 9 innings and closing down the game. Again, this is the impact with the relief pitchers.

boxplot(low_era_closers$salary_ratio, low_era_nonclosers$salary_ratio, 
        names=c("Closers", "Non-Closers"), ylab="Salary / Avg. Relief Pitcher Salary",
        main = "Salary Ratio Between the Closers vs. Non-Closers",
        col = "khaki")

plot(low_era_closers$salary_ratio,low_era_closers$era, 
     xlab="Salary / Avg. Relief Pitcher Salary", ylab="ERA (ER per 9 Innings)", 
     main = "Salaries for ERA Comparisons",
     xlim=c(0,10), ylim=c(0.6,3.0), col = "gold", pch = 16)
points(low_era_nonclosers$salary_ratio, low_era_nonclosers$era, pch = 16, col = "green")
legend(7,1.0,c("Closers", "Non-Losers"),c("gold", "green"))

For the concluding analyses on the closers, it turns out that the closers make more money than the non-closers based on the given performance on having more saves to end the game. However, it is important to secure leads during late in the game to earn victories, and not playing extra innings.

Errors History

Sum the errors By yearID

errors_by_year = tapply(fieldING$E, fieldING$yearID, sum)

Make the Year as a Factor

years = factor(fieldING$yearID)
years = levels(years)

New Dataset with Total number of Errors by Year

ebyt2 = cbind(errors_by_year, years)
ebyt2 = data.frame(ebyt2)

Plotting the Errors by Year

ggplot(ebyt2, aes(years, errors_by_year)) + 
  geom_bar(stat = 'identity', col = "darkred", fill = "skyblue") + 
  labs(title="Errors By Year Since 1920", x = "Years", y = "Errors By Year") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  scale_y_discrete(breaks = pretty(errors_by_year, n=10))

The data is all chronological for the errors that are done based on the r console. However, the focus is going to look at 1920 to present, since the analyses before is not as much of a factor for errors.

Looking & Plotting Errors by Position

atopper = ddply(fieldING, "POS", summarise, errors_by_position = sum(E))

ggplot(atopper, aes(POS, errors_by_position)) + 
  geom_bar(stat='identity', fill = "red") + 
  labs(title="Errors by Position Since 1920", x = "Position", y = "Total # of Errors by Position")

Based on position for errors in baseball, it turns out that Shortstop makes the most errors in the game. It is not a surprise as they could miss catches, since they are at a closer to the 2nd Base and a distance from 3rd as well. Possibilities are also the miss throws when the receiver couldn’t catch the ball, also dropped balls as well.

Looking & Plotting Errors by Team

btopper = ddply(fieldING, "teamID", summarise, errors_by_team = sum(E))


ggplot(btopper, aes(teamID, errors_by_team)) + 
  geom_bar(stat='identity', fill = 'violet') + 
  labs(title = "Errors by Team Since 1920", x = "Team", y = "Total # of Errors")

Conclusion:

I have learned the importance of tackling datasets that involves with having multiple spreadsheets from the information that is given. Baseball is a sport that can have unpredictability when it comes to the closers, where they have made more of an impact of winning for the team.I have taken the opportunity to expand the visuals and to implement more statistical practices on using the dataset. Utilized cleaning on the datasets along the way, and not doing it all at one setting. By visualizing the data from the past, it turns out that it is possible to look at some statistics from the beginning. Most of it has more of an impact in the last 30 years when it comes to performance and salaries. Also, would want to have better identification on which model would be the best for further analyses.

Future Work:

Would want to put it on a Shiny App if possible. Take more of different methods from what I have learned and to go beyond. Also would do differences with the interaction with the regression models that I have created as well.